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Abstract 

It  was  suggested  that  the  extremity  of  the  scale  values 
ass'clated  with  standards  used  to  represent  effective  and 
Ineffective  performance  In  Mixed  Standard  Scales  may  affect  the 
nature  of  performance  ratings  derived  from  MSS  responses  and 
decisions  based  on  MSS  ratings.  When  the  extremity  of 
standards  was  experimentally  manipulated,  it  was  found  that 
standard  extremity  affects  both  the  level  of  performance 
ratings  and  Ihe  proportion  of  logically  inconsistent  response 
patterns  observed.  In  addition,  standard  extremity  appears  to 
affect  the  rankings  on  performance  of  r a tees.  The  Implications 
of  these  observations  for  the  development  of  Mixed  Standard 
Scales  were  discussed. 


*;■  ..as-iion  For 

JIT1';  JP.U1 
I)?'.'  T iJ 
lfcfjvij'inc3d  ['] 

J  .  '  i  '.at  ten _ 


!ty _ 

DIM  rl  but  for./ 

Availability  ?odes 
jAvall  and/or 
list  I  Sprclal 


1C 


Effects  of  Standard  Extremity  on  Mixed  Standard 
Scale  Per  for nance  Ratings 

In  1972,  Blanc  and  Chisel  I  i  introduced  the  Mi*ed  Standard 
Scale  (MSS)  approach  to  rating  employee  performance.  Like: 

the  more  popular  BARS  approach,  the  MSS  procedure  assumes  that 
raters  will  make  more  accurate  and  reliable  judgments  about  tne 
levels  at  which  their  employees  arc  performing  If  they  a  r 
provided  with  descriptions  of  the  kinds  of  behaviors 
characterizing  effective  and  ineffective  performance  on  each 
performance  dimension.  Ur, I  ike  BARS,  the  MSS  is  a  derived 

scale,  in  which  neither  the  performance  dimension  nor  the 
effectiveness  level  of  anchor  statements  is  provided  to  raters 
when  they  use  the  scales.  Rather  than  being  asked  to  compare 
each  ratee  to  a  continuum  of  performance  effectiveness  for  each 
dimension,  the  rater  is  asked  to  compare  the  ratee's 

performance  to  a  series  of  statemun  rs  (standards)  representing 
varying  levels  of  performance  effectiveness  and  varying 

performance  dimensions.  Ihe  standards  are.  "mixed"  (presented 
in  a  random  order)  so  that  neither  the  effectiveness  levels  nor 
the  performance  dimensions  they  represent  are  readliy  apparerf 
to  the  rater.  For  each  statement,  the  rater  must  decide 
whether  ratee  performance  equals,  surpasses,  or  is  less 

effective  than  the  performance  level  exemplified  In  the 
standard.  The  patterns  of  responses  to  the  standards 
representing  each  perfcmancc  dimension  are  then  transformed 
into  dimension  raf’ngs  on  a  7-point  scale. 
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Because  the  underlying  rating  scale  Is  disguised  to  the 
rater ,  B I  a  n  z  and  Ghlsell!  ex peeled  t  h  a 1  such  rater  biases  as 
leniency  and  halo  would  be  reduced.  In  addition,  since 
mixed  standard  scales  are  assumed  to  have  Guftman  properties, 

the  patterns  of  responses  that  raters  exhibit  can  be  Indexed  In 

2 

terms  of  their  logical  consistency  .  Raters  with  high  levels 
of  logical  inconsistency  can  be  Identified,  and  perhaps  be 
given  special  attention  or  training.  Likewise,  ratees  for  whom 
high  levels  of  logical  error  are  observed  can  be  identified. 

Despite  the  stated  advantages  ol  the  format,  the  MSS 
approach  to  rating  employee  performance  nas  received  only 
Intermittent  attention  from  industrial  psychologists  In  the 
last  ten  years.  Most  examinations  of  the  MSS  format  Lave 
focused  on  one  of  three  Issues. 

a)  difficulties  associated  with  deriving  a  consistent 

coding  system  for  transforming  item  reponses  into 
dimension  ratings  (Saal,  1979); 

b)  the  effect  of  anchor  conTent  and  developmental 

procedures  on  the  psychometric  characteristics  of 
ratings  obtained  with  MSS  (Dickinson  4  Zellinger, 
1980); 

c)  comparisons  of  the  psychometric  characteristics  of 

ratings  obtained  from  MSS  and  other  rating  formats 
(Arvey  4  Hoyle,  1974;  Dickinson  4  Zollinger,  1980; 
Finley,  Osburri,  Duhln,  4  Joann*  ref,  197  7  ;  Saal  4 

Landy,  1977;  Saal ,  1^79) 

In  general,  evaluations  of  the  MSS  format  have  been  mixed. 
While  most  examinations  of  leniency  tuivo  concluded  that  the 


mixed  standard  scale  format  perform'  at  least  as  well  as,  and 
sometimes  better  than,  the  BARS  format  ui  simple  grapnlc  rating 
scale  (Finley,  et  al.,  1977;  Saal,  1979;  Saal  &  Landy,  1977), 
conclusions  regarding  the  relative  effectiveness  of  the  MSS 
format  In  reducing  levels  of  halo  have  been  Inconsistent, 
sometimes  favoring  MSS  (Saal,  1979;  Saal  &  Landy,  1977),  and 
sometimes  favoring  BARS  (Arvey  &  Hoyle,  1974;  Finley,  et  al., 
1977).  Lack  of  inter-ra+er  reliability  does  seem  to  be  a 
consistent  problem  with  the  MSS  (Arvey  &  Hoyle,  1974;  Finley, 
et  al.,  1977;  Saal,  1979;  Saal  &  Landy,  1977).  However,  the 
convergent  and  discriminant  validity  of  ratings  obtained  with  a 
mixed  standard  format  appears  to  be  acceptable  and  equivalent 
to  that  observed  In  ratings  obtained  with  a  BARS  format,  as 
long  as  similar  developmental  procedures  (i.e.  behavioral 
anchors  and  retranslation  of  expectations)  are  used  to  produce 
the  scales  (Arvey  &  Hoyle,  1974  ;  Dickinson  &  Zel  linger,  1980  ). 

Since  the  number  of  behavioral  examples  anchoring  each 
performance  dimension  is  very  small,  and  raters  are  provided  no 
information  about  the  relative  or  absolute  performance  levels 
that  these  anchors  are  Intended  to  represent,  the  nature  and 
underlying  scale  values  of  the  anchors  chosen  to  describe  a 
performance  dimension  have  potentially  important  implications 
for  the  ways  In  which  raters  respond  to  the  instrument,  and  the 
ratings  that  are  derived  from  those  responses.  Yet  little  Is 
known  about  the  manner  in  which  anchors  are  chosen  for  mixed 
standard  scales,  or  the  influence  that  this  aspect  of  the 
development  process  miqhl  have  on  performance  descriptions. 


The  current  study  oddrosscs  this  Issue.  Specifically,  we  were 
Interested  in  the  impact  of  anchor  selection  procedures  on  the 
ratings  obtained  with  a  rating  Instrument  that  utilizes  a  mixed 

standard  format. 

Consider  the  typical  recommended  procedure  for 
constructing  a  mixed  standard  scale: 

Step  1:  Generate  and  define  performance  dimensions  to  be 
ev  a  I ua  te d  . 

Step  2:  Generate  critical  incidents  describing  different 
levels  of  performance  for  each  dimension 
(anchors,  or  "standar ds"  ) . 


Step  3:  For  each  performance  dimension,  choose  a  standard 
representing  high,  moderate,  and  low  levels  of 
effectiveness,  respect  ively. 

Step  4:  Prepare  a  final  list  of  performance  standards 

which  has  Deen  mixed  across  performance 
dimensions  and  across  per  torn a  nee  levels  (i.e..  If 
there  are  X  dimensions,  the  final  list  will  have 
3X  performance  standards  to  which  the  rater  must 
respond )  . 

Our  questions  grew  out  of  a  consideration  of  Step  3. 
Since  ratings  obtained  with  a  Mixed  Standard  Scale  are  derived 
scores,  the  scale  values  of  the  standards  to  which  raters 
respond  are  Ignored  once  the  standard  has  been  assigned  to  the 
category  high  (H),  moderate  (M),  or  low  (L).  Instead, 
dimension  ratings  arc?  based  on  the  rater's  patterns  of 
responses  to  the  chosen  standards  for  each  dimension.  However, 
no  standard  decision  rules  have  been  presented  to  guide  the 
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Instrument  developer  in  selecting  standards  to  represent  the 
various  performance  levels  (with  the  ryropt  ion,  of  course,  'h<it 
the  resulting  scales  should  have  Guttman  properties).  for 
Instance,  If  we  think  of  the  various  standards  as  representing 
various  levels  on  a  seven -  point  scale,  the  developer  might 
choose  standards  with  scale  values  of  6,  4,  and  2  to  represent 
the  categories  H,  M,  and  L  respectively.  Alternatively,  he/s  ho 
might  choose  standards  with  scale  values  of  7,  4,  and  1  to 
represent  the  same  categories.  The  mixed  standard  scales 
produced  by  these  decision  rules  vary  in  terms  of:  (1)  the 
extremity  of  scale  values  underlying  each  rating  dimension;  and 
(2)  the  amount  of  scale  separation  among  standards  representing 
different  levels  ot  performance.  (The  two  are  of  course  not 
Independent  of  one  another,  since  the  extremity  of  the  scale 
values  constrains  the  amount  of  scale  separation  among 
standards.)  These  variations  may  affect  rater  responses  to  the 
scales,  with  Implications  for  the  p  s  v  ch  cm  el  r  I  c  characteristics 
of  the  ratings  obtained,  and  for  decisions  which  are  based  up<n 
those  rat  I ngs. 

Firsf,  consider  the  way  In  which  extremity  of  chosen 
standards  (hereafter  referred  lo  as  "standard  extremity")  might 
influence  the  level  of  ratings  assigned  with  a  MSS.  Raters  are 
asked  to  decide  whether  each  ratee  performs  at  (0),  above  (+), 
or  below  (-)  the  level  of  each  standard  presented  to  them.  The 
pattern  of  three  responses  to  each  dimension  is  transformed 
into  a  rating  on  a  seven-point  scale.  The  probability  of 
responding  +,  -,  or  0  to  a  particular  standard  should  he 


affected  by  the  performance  level  of  the  rutee.  However,  i> 
will  also  be  constrained  hy  the  oxlirmlty  of  the  anchors  chn  ,u 
to  represent  high  and  low  performance.  Behaviors  at  the 
extreme  ends  of  the  scale  will  be  relatively  rare.  Raters  who 
are  responding  to  standards  chosen  from  the  extreme  ends  would 
thus  be  less  likely  to  have  observed  behaviors  at  those  levels 
than  would  raters  responding  to  less  extreme  standards.  As 
such,  we  would  expect  that  raters  using  a  mixed  standard  scale 
comprised  of  high  performance  standards  (H)  with  scale  value., 
of  7  would  be  less  likely  to  respond  with  the  pattern  of 
responses  which  Is  transformed  to  a  rating  of  7  (+++)  than 
would  raters  using  a  scale  with  high  performance  standards 
having  a  scale  value  of  6.  A  similar  situation  should  occur 
when  raters  attempt  to  provide  ratings  of  low  performance 
levels.  This  would  result  in  decreased  variability  in  assigned 
ratings  but  no  change  in  the  level  cf  ratings  if  performance  is 
normally  distributed;  but  generally  this  is  not  the  case. 
Typically,  actual  performance  distributions  In  organizations 
are  negatively  skewed.  When  the  distribution  Is  skewed,  we 
would  expect  standard  extremity  to  have  a  linear  effect  on  the 


level  of  ratings  assigned. 


The  implication  for  the 


distribution  of  ratings  derived  from  rater  response  patterns 
will  be  Increased  central  tendency  In  ratings  gathered  from 
mixed  standard  scales  whose  performance  standards  have  extreme 
scale  values. 

The  amount  of  scale  separation  among  performance 
standards,  on  the  other  hand,  would  be  expected  to  affect  the 


degree  to  which  raters  are  able  to  reliably  differentiate  among 


performance  levels.  Performance  standards  representing  scale 
values  of  I ,  4,  and  7  should  be  more  readily  distinguished  and 

rank-ordered  than  performance  standards  with  scale  values  of  2, 
4,  and  6,  for  example.  The  latter  are  perceptual ly  more 
similar  to  one  another  in  terms  of  performance  level.  As  a 
result,  we  might  expect  to  see  an  increase  In  the  frequency  of 
logical  errors  present  In  ratings  as  the  distance  between 
anchor  statements  decreases. 

The  current  study  tested  both  of  these  hypotheses  a  l"  t 
the  effect  of  developmental  procedures  on  the  characteris 
of  ratings  obtained  when  a  MSS  Is  used  to  evaluate  perf ormai 
In  addition,  two  other  Issues  were*  examined.  Since  one  of 
Intended  advantage:.  of  the  "mixed"  format  of  the  MSS  is  the 
reduction  of  halo.  It  is  reasonable  to  ask  whether  anchor 
extremity  (and  resulting  decreased  anchor  separation)  affects 
halo.  Finally,  we  thought  it  important  to  consider  the 
practical  Implications  that  anchor  selection  procedures  might 
have  for  decisions  made  on  the  basis  of  inter-individual 
comparisons.  For  example,  when  a  promotion  decision  is  being 
made  by  a  supervisor  or  personnel  department,  most  often  the 
task  is  one  of  rank-ordering  eligible  employees  in  terms  of 
some  criterion  of  performance  effectiveness  or  potential  to 
perform.  When  we  use  a  mixed  standard  scale  to  differentiate 
among  employees  in  this  way,  does  the  extremity  of  the 


performance  standards  in  the  M£.S  affecl  the  rank-ordering  of 
ratees,  and  ultimately  the  decisions  of  the  organization? 


METHOD 


Subjects.  Subjects  were  248  students  recruited  from  tho 
classes  of  seven  Introductory  psychology  Instructors  who  agreed 
to  participate  In  this  study.  Participation  in  the  study  was 
vo I untary . 

Materials.  Three  mixed  standard  scales  for  the  evaluation  of 
teacher  performance  were  prepared  from  a  pool  of  statements 
previously  developed  for  use  in  behavioral  expectation  scales 
by  Harari  and  Zedeck  (1973).  These  materials  were  chosen 
specifically  because  they  represented  an  example  of  performance 
appraisal  scales  that  met  several  Important  criteria: 

1)  Behavioral  I y  anchored  -  the  content  of  the  anchors  was 
behavioral  and  specific; 

2)  Rigorous  development  pr oced u r es- th e  Harari  and  Zedeck 

scales  were  carefully  developed  using  the 

retranslation  of  expectations  (RE)  technique  to 
eliminate  anchors  which  were  not  unambiguous  examples 
of  performance  dimensions,  and  a  second  screening  to 
eliminate  those  anchors  for  which  there  was 
disagreement  about  the  effectiveness  level  (scale 
value)  represented. 

3  .  Multiple  anchor  points  -  the  '  :  h  a  v  i  o  r  a  I  anchors 
represented  a  range  of  seals'  values  which  could  be 
easily  translated  Into  ml>od  standard  scales  having 
the  variations  In  standard  extremity  that  we  required 
to  test  cur  hypothesis; 


4.  Invariance  of  scale  values  -  because  the  bchavloial 
anchors  used  In  this  study  wore  developed  and  scaled 
In  another  setting,  there  was  some  concern  that  the 
scale  values  might  not  generalize  to  settings  other 
than  the  one  In  which  +  h  e  scales  were  developed. 
However,  a  study  by  Lundy  and  Barnes  (1979)  which  used 
statements  from  tl.e  same  pool.  Indicated  that  the  mean 
scale  values  assigned  to  the  M-s  -iv  t  or  a  i  statements 
developed  by  Harar i  end  7. o d e <  k  (1973  )  dirt  not  change 
when  those  statements  wore  rote  a  led  several  years 
later  at  a  second  university. 

Each  MSS  was  composed  of  a  total  of  twelve  statements 
representing  high,  moderate,  and  low  levels  of  performance 
effectiveness  In  four  areas:  Delivery,  Ability  to  Motivate 
Student,  Depth  of  Knowledge,  and  Interpersonal  Relations  with 
Students.  Information  about  the  scale  values  of  behavioral 
statements  defining  each  dimension  was  used  to  construct  three 
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MSSI  (HE) 

was 

con,  posed  of  statements  reflecting  maxim-- My  extreme-  scale 
va 1 ues  for  each  of  the  four  dimension  v  r op  re sen  red  on  the 
appraisal  instrument.  MSS  I  I  (ME)  was  Composed  of  statements 
with  moderately  extreme  scale  values.  MS'  ll  I  (IF)  was  compose j 
of  statements  with,  minimal  I  v  extreme  scale  values.  The-  scale- 
values  associated  with  the  standards  comprising  each  of  the 


three  mixed  standard  scales  are  presented  In 


Table  1  . 


If;  serf  Tab  I  e  '  /iho.it  fur 


Procedure.  Studenls  of  the  save  a  i  n  1  r  . .'due  ter  y  psychology 
Instructors  who  agreed  to  par  Me  Ip ate  In  this  study  were 
randomly  assigned  to  one  of  three  condition.:  High  Extremity 
(HE),  Moderate  Extremity  (ME),  or  Low  Extremity  (LE). 
All  subjects  were  told  that  they  would  to  evaluating  the 
performance  of  their  instructor,  and  that  their  evaluations 
would  be  used  to  provide  feedback  to  fi.eir  instructor  about 
his/her  performance  strengths  and  weaknesses.  Each  subject 
rated  only  one  instructor.  Fach  instructor  was  evaluated  by  23 
to  53  students. 

Analyses.  Responses  to  th<  mixed  standard  scales  were  coded  to 
produce  performance  (J I  mens  I  o:\  ratings  on  a  /-point  scale,  using 
the  coding  scheme  suggested  by  Saal  (1979). 

Means  and  standard  deviations  for  assigned  dimension 
ratings  were  calculated  for  each  of  the  exper  imonta!  conditions 
(HE,  ME,  LE)  anc  for  each  rutoe.  In  .nidi  t  Ion,  dimension 
Intercorrelation  matrices  were  constructed  for  each  of  the 
three  experimental  conditions.  Fir. ally,  a  simple  tally  of  the 
number  of  Inconsistency  errors  (response  patterns  Inconsistent 
with  the  scaled  order  of  standards  for  each  dimension)  was 
computed  tor  each  experimental  condition. 

To  test  the  hypothesis  that  standard  ex+remlty  affects  the 
level  of  central  tendency,  central  tendency  was  operationalized 
as  a  level  effect.  A  multivariate  analysis  of  variance 
(MANOVA)  linear  trend  analysis  was  performed  to  test  the 
effect  of  standard  extremity  on  perform c.nc.f  rating;..  This  was 


followed  by  one-way  analyses  of  variance  (ANOVAs)  for  each 
performance  dimension,  using  standard  extremity  as  the 
Independent  variable,  and  assigned  performance  rating  as  the 

Independent  variable. 

To  test  the  hypothesis  that  the  amount  of  "logical"  error 
Is  affected  by  standard  extremity,  all  dimension  ratings  were 
scored  as  consistent  (no  logical  error=0)  or  Inconsistent 
(logical  error=1).  That  Is,  If  the  set  of  responses  to  a 
dimension  was  one  of  the  7  logical ly  consistent  response 
combinations,  the  rating  derived  from  that  set  of  responses  was 
said  to  be  logically  consistent.  A  rating  derived  from  any  one 
of  the  20  logical  I  ly  inconsistent  response  combinations  was 
said  to  be  logically  inconsistent.  Alhough  there  are  many  ways 
to  provide  consistent  ratings  (7  ways)  and  inconsistent  ratings 
(20  ways),  each  rater  could  only  commit  une  logical  error  per 
performance  dimension.  A  one-way  ANOVA  using  standard 
extremity  as  the  Independent  variable  and  proportion  of  logical 
errors  in  the  observed  ratings  as  tne  dependent  variable,  was 
performed  for  each  performance  dimension  after  a  MANOVA  for 
linear  trends  was  used  to  test  the  effect  of  standard  extremity 
on  i og i ca I  error  s . 

In  order  to  examine  the  question  of  whether  halo  is 
affected  by  standard  extremity,  halo  was  operationalized  In  two 
ways.  The  first  index  of  halo  was  defined  as  the  mean 
intercorrelation  between  dimension  ratings  assigned  in  each 
condition.  To  compute  mean  Intercorrelation  levels,  a  Fisher  Z 
transformation  was  applied  to  the  zero- order  intercorrelation 


matrices.  A  chl-squere  test  for  homogeneity  was  used  to  test 
the  hypothesis  that  levels  of  halo  are  different  for  different 
experimental  conditions.  Halo  was  also  operationalized  as  the 
standard  deviation  of  each  rater's  ratings  across  the  four 
performance  dimensions  (where  high  standard  deviations  Indicate 
low  halo  levels).  In  order  to  use  standard  deviations  as  data 
points,  a  log  transformation  was  applied.  A  one-way  ANOVA  was 
performed,  using  standard  extremity  as  the  Independent 
variable. 

Finally,  the  practical  implications  of  variations  in 
standard  extremity  were  ex  a  mined  by  rank-ordering  Instructors 
on  the  basis  of  the  mean  performance  ratings  assigned  to  them 
for  each  dimension  and  the  overall  mean  ratings  assigned  to 
them.  A  rank  order  correlation  between  HE  ratings  and  ME 
ratings  was  computed  for  each  dimension  and  for  the  overall 
mean  summated  ratings.  The  same  comparison  was  made  between  ME 
ratings  and  LE  ratings,  and  between  HE  ratings  and  LE  ratings. 
Since  the  number  of  teachers  being  ranked  was  sma I  I  (n  =  7)  tau 
rather  than  Spearman's  rho  was  used  (Thorndike,  1978). 
However,  tau  ranges  from  -1.0  to  +1.0  and  Is  Interpreted  In  the 
same  manner  as  rho. 


RESULTS 

A  MANOVA  using  performance  ratings  on  all  four  performance 
dimensions  as  dependent  variables  and  standard  extremity  as  an 
Independent  variable  indicated  that  standard  extremity 
significantly  affects  the  level  of  assigned  ratings.  Further, 
the  analysis  Indicated  a  significant  linear  trend  (F 

1  2 


approximation  for  P i I  I  a  I -Bar M ett  V-4.b4;  df=4,240;  p  < . 0 1 ) . 

V 

The  results  of  followup  univariate  ANOVAs  conducted  separately 
for  each  performance  dimension  can  be  seen  in  Table  2. 

Insert  Table  2  about  here 

Standard  extremity  had  a  signlficanl  effect  f  p  <  .  0  0 1 >  on  the 
level  of  performance  ratings  assigned  for  two  performance 
dimensions:  Ability  to  Motivate  and  Depth  of  Knowledge,  and  a 

marginally  significant  effect  for  the  remaining  two  dimensions: 
Delivery  (p<.09)  and  Interpersonal  Relations  with  Students 
(p<.06).  An  examination  of  the  cel  I  means  for  each  dimension 
(also  shown  in  Table  2)  Indicates  a  pattern  of  results 
generally  consistent  with  our  hypothesis  that  central  tendency 
will  Increase  as  the  extremity  of  scale  values  underlying 
standards  of  high  and  low  performance  increases.  For  all  four 
dimensions,  mean  ratings  were  closest  to  the  center  of  the 
scale  for  the  high  extremity  condition.  Post  hoc  linear  trend 
analyses  showed  significant  linear  trends  In  the  data  for  the 
first  three  dimensions  ( ;>  •'  ,0b)  and  a  marginally  significant 
linear  component  for  the  fourth  dimension  (p  <  .08). 

In  addition  to  the  analysis  for  the  total  sample,  a 
similar  analysis  was  performed  on  that  subset  of  rater 
responses  consisting  only  of  those  ratings  representing  logical 
response  patterns.  This  was  done  in  order  to  explore  what 
effects  standard  extremity  might  have  when  ratings  are 
uncontaminated  by  the  error  variance  introduced  when  raters 
respond  In  logically  inconsistent  ways.  In  other  words,  we 


were  Interested  In  identifying  whether  a  level  effect  would 
still  be  observed  for  those  cases  in  which  the  MSS  was  used  as 
It  was  Intended  to  be  used,  free  from  logical  Inconsistency 
errors.  We  found  that  this  secondary  analysis  makes  the 
pattern  of  results  even  clearer  (see  Table  3).  For  all  three 
dimensions  In  which  a  significant  main  effect  was  observed 

Insert  Table  3  about  here 

(Delivery,  Ability  to  Motivate  and  Depth  of  Knowledge)  the 
means  were  ordered  In  the  expected  pattern,  and  significant 
linear  trends  were  found  (p<.01). 

A  MANOVA  using  standard  extremity  as  the  Independent 
variable  and  proportions  of  logical  inconsistency  error  In  each 
of  the  four  performance  dimensions  as  dependent  variables  also 
supported  the  hypothesis  that  the  amount  of  logical  error 
present  In  MSS  ratings  Is  affected  by  standard  extremity.  As 
witn  the  previous  analyses  the  effect  of  the  independent 
variable  had  a  significant  linear  component  (F  approximation 
for  P I  I  I  a  I -Bart  I ett  V  =  3.68;  of  -  4,242;  p  <  .  0 1 ) .  However  as 
can  be  seen  In  Table  4,  univariate  ANOVAs  conducted  for  each  of 

Insert  Table  4  about  here 

the  four  performance  dimensions  Indicated  a  significant  main 
effect  for  only  one  dimension:  Ability  to  Motivate  (pc.001). 
The  cell  means  for  Ability  to  Motivate  show  decreasing  levels 
of  logical  Inconsistency  error  as  standard  extremity  increases, 
as  predicted  (linear  trend,  F  =  13.38,  df  ~  1,245,  p  <  .01). 

Halo  was  not  affected  by  standard  extremity.  The  mean 


In+ercorrela+lon  between  dimension  ratings  ranged  from  r- . 34  to 
r®  .38  (*X  =  .10,  df  =  7,  n.s.)  for  the  three  experimental 
conditions.  The  standard  deviation  of  each  rater's  assigned 
ratings  across  the  four  dimensions  ranged  from  1.3  9  to  1.57 
(F=.46;  df=2,243;  n.s.). 

Finally,  examination  of  the  rank-ordering  of  Instructors 
which  is  produced  hy  performance  ratings  provides  evidence 
that  the  extremity  of  scale  anchcrs  in  a  MSS  affects  the  ran-, 
order  of  Instructors.  Values  of  tau  summarizing  the  similarity 
of  rank- or  derings  produced  under  different  experimental 
conditions  are  reported  I  r,  Table  5.  Examination  of  this  table 
reveals  that  the  magnitude  of  the  tau  statistic  for  the  rank- 
order  comparisons  was  low.  Only  3  of  the  15  tau  coefficients 

Insert  Table  5  about  here 

calculated  were  significantly  greater  than  0.  None  of  the 
comparisons  between  HE  and  LE  conditions  or  between  ME  and  LE 
conditions  produced  significant  rank -order  associations.  The 
only  significant  correlations  were  observed  in  rank  orderings 
produced  In  the  HE  and  ME  conditions  which  were  similar  for 
Delivery,  Ability  to  Motivate  Students  and  Overai!  Mean  Rating 
(T  =  .71, .90,  and  .71  respectively). 

D ! SCUSS ION 

The  results  of  this  study  generally  supported  our 
hypotheses  that  the  extremity  of  the  scale  values  associated 
with  standards  chosen  in  the  development  of  mixed  standard 
scales  affects  1)  the  level  of  ratings  assigned,  and  2)  the 


which  are 


number  of  logically  inconsistent  ri'-,ponr,e  patterns 
exhibited;  and  3)  the  relative  position  of  respondents  In 

performance  distributions. 

Support  for  the  first  hypothesis  was  relatively 
consistent.  Indicating  a  tendency  for  ratings  to  be  assigned 
closer  to  the  center  of  the  scale  as  standard  extremity 

Increased.  This  effect  became  more  pronounced  when  we  examinee 

the  subset  of  ratings  which  conformed  to  one  of  the  seven 

logically  consistent  response  patterns.  Presumably,  these 
ratings  are  free  of  some  of  the  "noise"  contaminating  the  full 
set  of  ratings.  Yet  It  Is  apparent  that  the  noise  Introduced 
when  raters  respond  to  mixed  standard  scales  In  logically 
Inconsistent  ways  only  masks  the  underlying  phenomenon  to  some 
extent.  Thus,  attempts  to  Improve  the  quality  of  ratings  by 

training  raters  to  respond  carefully  (In  logically  consistent 
patterns)  will  only  make  developmental  Issues  like  this  one 
more  Important.  From  a  practical  standpoint,  the  organization 
in  the  process  of  developing  or  revising  a  performance 
appraisal  Instrument  using  a  mixed  standard  format,  can  use 
this  Information  to  advantage.  For  example.  If  positive 

leniency  Is  a  problem,  careful  attention  should  be  paid  to 
choosing  high  and  low  effectiveness  standards  with  scale  values 
falling  as  close  to  the  extreme  ends  of  the  scale  as  possible. 
In  any  case,  the  instrument  developer  should  be  aware  of  the 
fact  that  all  examples  of  highly  effective  (or  highly 
Ineffective)  performance  are  not  necessarily  equivalent,  and 
that  the  choice  of  standards  that  is  made  may  affect  the  level 
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of  ratings  assigned.  This  Issue  might  be  of  particular 
Importance  for  performance  appraisal  system1'  In  which  an 
attempt  Is  made  to  measure  several  dimensions  of  employee 
performance  and  then  form  a  profile  of  employee  strengths  and 
weaknesses.  If  attention  Is  not  paid  to  the  issue  of 
underlying  scale  values,  the  rank  ordering  of  an  employee's 
weaknesses  might  reflect  a  rank  ordering  of  performance 
dimensions  on  the  basis  of  standard  extremity  rather  than  a 
measure  of  employee  performance  on  dimension  A  relative  to 
dimension  B,  etc.  As  we  pointed  out  In  the  Introduction, 
standard  extremity  would  only  be  expected  to  affect  the  level 
of  assigned  ratings  when  the  distribution  of  actual  performance 
Is  skewed.  Although  we  have  no  way  of  determining  whether  this 
was  the  case  In  the  sample  of  ratees  that  we  observed,  there  Is 
good  reason  to  believe  that  negatively  skewed  performance 
distributions  are  typical  in  organizations  (cf.  Bernardln  & 
Pence,  I960)  and  in  samples  of  teachers  In  particular  (Zedeck, 
Jacob,  &  Kafry,  19/6). 

Support  for  our  hypothesis  regarding  the  effect  cf 
standard  extremity  on  the  proportion  of  logically  Inconsistent 
response  patterns  observed  was  weaker.  The  expected  Increase 
In  logical  Inconsistency  errors  as  the  sca'e  separation  ot 
standards  decreases  was  only  observed  for  one  dimension: 
Ability  to  Motivate  Students.  While  the  choice  of  standard 
didn't  influence  the  error  rate  as  expected,  it  Is  significant 
to  note  the  high  frequency  of  these  "error"  responses.  Even 
when  using  scales  that  have  been  carefully  developed  using 
retranslatlon  of  expectations  procedures  to  ensure  that 
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standards  unambiguously  represent  performance  dimensions,  and 
response  scaling  to  ensure  that  raters  agree  on  the 
effectiveness  level  represented  by  each  anchor,  approximately 
half  of  the  performance  ratings  collected  were  derived  from 
patterns  of  responses  that  .ere,  In  one  way  or  another, 
logically  Inconsistent.  The  high  frequency  of  logical  errors 
may  be.  In  part,  a  reflection  of  the  motivation  of  student 
raters  to  do  a  careful  job  In  evaluating  their  Instructors' 
performance  and  recording  those  evaluations.  (It  was  for  this 
reason  that  we  felt  that  the  secondary  analysis  of  the  rating 
level  effect  data  was  necessary  and  useful.)  On  the  other 
hand,  low  motivation  may  be  typical  In  many  organizational 
settings.  Since  little  normative  data  on  the  frequency  of 
inconsistent  response  patterns  Is  available  In  the  published 
literature.  It  Is  difficult  to  say  whether  our  data  were 
contaminated  by  an  unusually  large  proportion  of  Illogical 
responses  or  whether  the  "error"  rate  In  our  data  Is  similar  to 
that  obtained  In  other  studies. 

Because  each  rater  only  evaluated  the  performance  of  o..e 
Instructor,  it  is  difficult  to  make  assessments  about  the 
degree  to  which  logical  errors  were  primarily  a  rater  effect 
rather  than  an  Instrument  effect  or  a  ratee  effect.  However, 
an  examination  of  the  Intercorrelation  matrix  summarizing  the 
relationships  between  error  scores  on  different  dimensions 
(l.e.,  error  or  no  error,  since  a  rater  can  only  make  one  error 


per  dimension)  revealed  significant  but  small  correlations 


(mean  correlation  betweon  dimensions  (f)  -  +  .14,  X  -  24.6,  p  < 
.001).  That  Is,  raters  who  respond  with  Inconsistent  patterns 
on  one  dimension  are  slightly  more  likely  to  make  logical 
errors  on  other  dimensions.  Still,  the  very  sml I  proportion 
of  raters  who  provided  a  complete  sot  of  responses  free  from 
logical  Inconsistency  errors  (only  9 %  of  the  sample:  21  of  248 
raters)  suggests  that  Inconsistency  errors  are  a  rather  general 
feature  of  this  set  of  ratings,  rather  than  a  problem  limited 
to  a  small  number  of  raters.  To  examine  the  possibility  of  a 
ratee  effect  on  logical  i neons  I stency  errors,  an  eta 
coefficient  between  ratees  and  the  total  number  of  logical 
errors  per  rater  was  calculated.  Eta-squared  was  only  .01 
(n.s.),  suggesting  very  little  (if  any)  relationship  between 
teachers  and  the  tendency  to  make  logical  errors  In  evaluating 
them. 

It  seems  reasonable  to  conclude,  then,  that  the  problem  of 
logical  inconsistency  errors  is  not  one  which  can  be  primarily 
attributed  to  individual  differences  In  raters  or  ratees,  but 
Is  more  likely  associated  with  the  instrument  and  the  way  that 
raters  respond  to  a  mixed  standard  scale  format.  The  magnitude 
of  the  problem  Is  such  that  research  directed  at  understanding 
the  conditions  which  influence  the  manner  in  which  raters 
respond  to  performance  appraisal  Instruments  with  disguised 
continua  is  necessary  If  we  are  to  have  any  confidence  In  the 
ratings  derived  from  such  scales.  If  the  major  source  of 
variance  Is  motivational.  It  might  be  reasonable  to  suggest 


training  or  some  similar  Intervention  as  a  strategy  for 
decreasing  the  problem.  On  the  other  hand,  if  the  source  of 


the  problem  Is  related  to  cognitive  strategies  which  are 
typically  used  by  raters  In  processing  and  evaluating 
performance  Information,  we  may  find  that  the  wiser  course 
would  be  to  modify  the  process  of  recording  performance 
evaluations  so  that  they  are  more  compatible  with  rater 
cognitive  strategies. 

The  observation  that  standard  extremity  may  affect  the 
rank-ordering  of  ratees,  both  for  Individual  performance 
dimensions  and  for  the  overall  mean  of  dimension  ratings 
clearly  Indicates  that  developmental  procedures  will  affect 
personnel  decisions  based  on  performance  ratings.  The  effect 
of  standard  extremity  on  I n ter  I nd I v I d ua I  comparisons  Is  not  a 
simple  one,  and  we  can  offer  no  straightforward  explanation  for 
why  rank  orderings  change  In  the  way  that  they  do.  We  also 
have  no  reason  to  believe  that  one  r an k -or der I ng  is  more 
accurate  than  another.  However,  the  mere  fact  that  rank  orders 
change  as  a  function  of  anchor  selection  procedures  suggests 
that  organizations  need  to  pay  close  attention  to  the 
underlying  scale  values  of  standards  chosen  to  represent 
effective  and  Ineffective  performance  when  developing  mixed 
standard  scales.  This  Is  particularly  important  when  several 
different  forms  will  be  developed  (e.g.,  rating  scales  tailored 
to  particular  groups  of  job  titles),  and  when  Important 
decisions  (e.g.  selection  for  promotion)  will  be  based  upon  a 
rank-ordering  of  employees  according  to  the  performance  ratings 
assigned  to  them. 
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By  logical  consistency,  we  mean  that  responses  conform  to 
Guttman  scale  assumptions.  Any  set  of  responses  to  Items 
on  a  Guttman  scale  which  forms  a  pattern  that  does  not 
conform  with  those  assumptions  Is  said  to  be  logically 
Inconsistent.  In  the  context  of  mixed  standard  scales, 
each  rating  is  derived  from  the  pattern  of  responses  (+, 
0,  or  -)  to  three  statements  representing  different  levels 
of  performance  effectiveness.  There  are  27  possible 
response  combinations  to  each  sot  of  three  statements. 
Seven  of  those  response  combinations  are  logically 
consistent  with  the  patterns  of  responses  that  would  be 
expected  If  those  statements  formed  a  Gu+tman  scale:  they 
are  said  to  be  logically  consistent.  The  remaining  20 
response  combinations  are  not  consistent  with  the  patterns 
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of  responses  that  would  be  expected  If  fhose  statements 
formed  a  Guttman  scale.  When  any  of  those  response 
combinations  Is  observed.  It  Is  referred  to  as  a  logical 
Inconsistency  error. 
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Table  1 


Scale 

Values  for  Standards  Used 

to  Create  Mixed  Standard 

Scales 

Performance 

Dimension 

Performance 

Effect  1 veness 

Del  I  v  ery 

Ability  to 

Know  ledge 

1  nterpersonal 

Level 

Motivate 

Relations 

High: 

HE 

6.4 

6.4 

6.4 

6.5 

ME 

5  .  9 

6.0 

5.7 

5.8 

LE 

5.0 

5.0 

4.2 

4 . 9 

Moderate 

:  All 

3  .8 

3.8 

3.8 

3.9 

cond i t I ons 

Low  : 

LE 

2.7 

2.4 

3.0 

2.9 

ME 

2.3 

2.0 

2  .  1 

2.1 

HE 

1  .  5 

1  .6 

1  .  4 

1  .3 

a  I 

HE  =  High  Extremity  condition 
ME  =  Moderate  Extremity  condition 
LE  =  Low  Extremity  condition 


Note:  For  examples  of  the  kinds  of  statements  used  as  standards  for  each 

dimension,  see  Harari  &  Zedeck  (1973)  or  Landy  &  Barnes  (1979). 


Tabic  2 


Cell  Means  and  Unlvarla+e  F-tests  for  the  Effects  of  Standard  Extremity 

on  Performance  Ratings 


Performance 

Dimension 


ANO V  A  S  umma  r y  Table 


Cell  Means 


Source 

SS 

df 

MS 

F 

p- 1  eve  1 

HE 

ME 

Del i ver y 

Extr  em  I  ty 

1  1  .02 

2 

5.51 

2.4? 

<  .09 

4.11 

4.36 

Residual 

554.05 

243 

2.28 

Tota  1 

565.07 

245 

2.31 

Ability  to 

Extrem i ty 

62.82 

2 

31.41 

20 .68 

<  .001 

3.88 

5.10 

Mot  I v  ate 

Residual 

369.05 

243 

1  .52 

Total 

431 .87 

245 

1  .76 

Depth  of 

Extrem  1  ty 

3  9.02 

2 

19.51 

8.9! 

<  .001 

4.57 

4.99 

Knowledge 

Residual 

531  .65 

243 

2.19 

T ota  i 

570.67 

245 

2.33 

Intel — 

Extrem 1 ty 

10.98 

2 

5.47 

2.91 

<  .06 

4.84 

5.33 

persona  1 

Res  I dua 1 

457.59 

243 

1  .88 

Relations 

Tota  1 

468.57 

245 

1  .  91 

High  Extremity  condition 
Moderate  Extremity  condition 
Low  Extremity  condition 


Table  3 


Cell  Means  and  Univariate  F~tests  for  the  Effects  of  Standard 
Extremity  on  Performance  Ratings:  Logical  Responses  Only 


Performance 


Dimension 

ANOVA 

S  ummary 

Tabl  e 

Ce  1 

I  Mea 

n  s 

Source 

SS 

df 

MS 

F 

p- 1  eve  1 

HE 

m: 

LE 

De 1 I  very 

Extrem 1 ty 

33 . 1  2 

2 

16.56 

5.32 

<  .01 

4.09 

5  .06 

5 . 4  ^ 

(N  =  105) 

Residual 

317.51 

1  02 

3.11 

T  ota  1 

350.63 

1  04 

3.37 

Ability  to 

Extrem 1 ty 

35.87 

2 

17.94 

10.14 

<  .001 

4 .27 

6.45 

6 .60 

Mot  1 v  ate 

Res  1 dua 1 

161 .03 

91 

1.7/ 

(N  =  94) 

Tota  1 

1  96 . 90 

93 

2.12 

Depth  of 

Extremity 

39.22 

2 

19.61 

9.66 

<  001 

5.05 

5.97 

6.42 

Knowledge 

Residual 

229.33 

1  1  3 

2.03 

(N  =  116) 

T  ota  1 

268.55 

1  1  5 

2.34 

1 nter- 

Extrem 1 ty 

2.94 

2 

1  .47 

.95 

>.10 

5.97 

6.18 

6.11 

per  sona  1 

Res  1 dua 1 

1 51 .98 

98 

1  .55 

Relations 

Tota  1 

154.91 

1  00 

1  .55 

(N  =  101) 


'I 


Table  4 

Cell  Means  and  Univariate  F-tests  for  the  Effects  of  Standard  Extremity 

on  Proportion  of  Logical  Errors 


Performance 

Dimension  ANOVA  Summary  Table  Cel  I  Means 


Source 

SS 

df 

MS 

F 

p- 1  eve  1 

HE 

ME 

LE 

De ! 1  very 

Extr  em 1 ty 

.08 

2 

.04 

.17 

>.10 

.60 

.55 

.57 

Res  I dua  1 

60.61 

245 

.25 

Tota  1 

60.69 

247 

.25 

Ability  to 

Extrem 1 ty 

3.06 

2 

1  .53 

6.7  5 

<  .001 

.49 

.60 

.76 

Mot  1 vate 

Residual 

6  5.55 

245 

.23 

Tota  1 

58.61 

247 

.24 

Depth  of 

Extr  em  1  ty 

.50 

2 

.25 

.99 

>.10 

.52 

.59 

.48 

Knowledge 

Residual 

61.25 

245 

.25 

T  ota  1 

6  1.74 

247 

.25 

Inter- 

Extrem 1 ty 

.02 

2 

.01 

.03 

>.10 

.60 

.59 

.58 

persona  1 

Res  1 dua 1 

60.03 

245 

.25 

Relations 

Tota  1 

60.04 

247 

.24 

28 


T  a  b  I  o  5 


Association  ("M  between  Rank-orderlngs  Produced 
by  Different  Experimental  Conditions 

Performance  Dimension  P  n '  -  or  der  Correlation 


HE  and  ME 

ME  and  LE 

HE  and  L E 

Del  1  very 

.7  1  * 

.24 

.  1  4 

Ability  to  Motivate 

.90** 

.  43 

.33 

Depth  of  Knowledge 

-.11 

-  .24 

.  43 

Interpersonal  Relations 

.33 

.?4 

.3  3 

Mea  n  Over  a  11  Rating 

.71* 

.30 

.40 

*p  <  .05 

**p  <  .01 
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